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We present a fast Markov Chain Monte-Carlo exploration of cosmological parameter space. We 
perform a joint analysis of results from recent CMB experiments and provide parameter constraints, 
including as, from the CMB independent of other data. We next combine data from the CMB, HST 
Key Project, 2dF galaxy redshift survey, supernovae la and big-bang nucleosynthesis. The Monte 
Carlo method allows the rapid investigation of a large number of parameters, and we present results 
from 6 and 9 parameter analyses of flat models, and an 11 parameter analysis of non-flat models. 
Our results include constraints on the neutrino mass (m v < 0.3 eV), equation of state of the dark 
energy, and the tensor amplitude, as well as demonstrating the effect of additional parameters on 
the base parameter constraints. In a series of appendices we describe the many uses of importance 
sampling, including computing results from new data and accuracy correction of results generated 
from an approximate method. We also discuss the different ways of converting parameter samples 
to parameter constraints, the effect of the prior, assess the goodness of fit and consistency, and 
describe the use of analytic marginalization over normalization parameters. 



I. INTRODUCTION 



There is now a wealth of data from cosmic microwave background (CMB) observations and growing amount of 
information on large scale structure from a wide range of sources. We would like to extract the maximum amount of 
information from this data, usually conveniently summarized by estimates of a set of cosmological parameter values. 
With high quality data one can constrain a large number of parameters, which in principle allows us not only to 
put estimates and error bars on various quantitative parameters, but also to address more fundamental qualitative 
questions: Do we observe primordial gravitational waves? Do the neutrinos have a cosmologically significant mass? 
Is the universe flat? Are the standard model parameters those that best account for the data? In addition we can 
also assess the consistency of the different data sets with respect to a cosmological model. 

Recent work involving parameter estimation from the CMB includes Refs. [jjj |[ ||, ^, ]|| [jj, |t], ||, ||, [To). In this paper 
we employ Markov Chain Monte Carlo (MCMC) techniques Jll], |lj, as advocated for Bayesian CMB analysis in 
Ref. and demonstrated by Refs. By generating a set of MCMC chains we can obtain a set of independent 

samples from the posterior distribution of the parameters given the data. From a relatively small number of these 
samples one can estimate the marginalized posterior distributions of the parameters, as well as various other statistical 
quantities. The great advantage of the MCMC method is that it scales, at its best, approximately linearly with the 
number of parameters, allowing us to include many parameters for only small additional computational cost. The 
samples also probe the shape of the full posterior, giving far more information that just the marginalized distributions. 

Throughout we assume models with purely adiabatic Gaussian primordial perturbations in the growing mode (our 
approach could easily be generalized to include isocurvature modes), three neutrino species, non- interacting cold dark 
matter, and standard general relativity. We compute all theoretical (CMB and matter power spectrum) predictions 
numerically using using the fast Boltzmann code CAMB |16 (a parallelized version of CMBFAST jl?]]). Our results are 
therefore limited by the accuracy of the data and could be generalized very easily to include any additional parameters 
that can be accounted for by modification of a Boltzmann code. 

In section [FJ we present constraints from the latest CMB data, illustrating the MCMC method. We defer a brief 
introduction to the MCMC method and a description of our implementation and terminology to Appendix [A| The 
method of importance sampling is also illustrated in section |lj and is described in detail Appendix |b], where we 
explain how it can be used to take into account different priors on the parameters, new data, and for accurate but fast 
estimation given a good approximation to the theoretical model predictions. We add large scale structure, supernova 
and nucleosynthesis constraints in section HI so that more parameters can be constrained. We compare flat models 
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FIG. 1: The CMB temperature anisotropy band-power data used in this paper. The line shows the model with the parameters 
at their mean values, given all data after marginalizing in 6 dimensions (i.e. first column of Table. Fll). 



with 'inflationary' priors (9 cosmological parameters) and then a more general models (11 cosmological parameters). 
This includes constraints on the neutrino mass, equation of state of the dark energy, and the tensor amplitude. In 
addition our results show the sensitivity of constraints on the standard parameters to variations in the underlying 
model. 



II. CMB CONSTRAINTS 



We use the results of the COBE g8), BOOMERANG @, MAXIMA ||, DASI @, VSA || and CBI @ || 

observations in the form of band power estimates for the temperature CMB power spectrum. These data arc plotted 
in Fig. 0. The very small angular scale results from CBI have been discussed extensively in [22, 24j and do not fit the 
linear predictions of standard acoustic oscillation models. Therefore for the purposes of this paper we assume that 
the small scale power observed by CBI has its origin in non-linear or non-standard small scale effects that we do not 
attempt to model, and so use only the mosaic (rather than deep pointing) data points throughout. In addition we 
use only the first 8 points (£ < 2000) from the odd binning of the mosaic fields since above that the noise in the band 
powers becomes much larger than the prediction for the class of models we consider. 

For COBE we use the offset- lognormal band-powers and covariance matrix from RAD PACK [0 . For DASI, VSA 
and CBI we also use the offset-lognormal band powers and integrate numerically over an assumed Gaussian calibration 
uncertainty. For BOOMERANG, MAXIMA we assume top hat window functions and uncorrelated Gaussian band- 
power likelihood distributions, and marginalize analytically over the calibration and beam uncertainties assuming they 
are also Gaussian . We assume a correlated calibration uncertainty of 10% on the CBI and VSA data (neglecting 
the ~ 3% uncorrelated difference in calibration), but otherwise assume all the observations are independent. Using 
the sets of samples obtained it is a simple matter to account for small corrections when the required information is 
publicly available (see Appendix |B|) . 

The base set of cosmological parameters we sample over are uj^ — fl\>h 2 and w c = ft c h 2 7 the physical baryon and 
CDM densities relative to the critical density, h — Hq/(100 km s _1 Mpc ), the Hubble parameter, = 1 — Otot 
measuring the spatial curvature, z ro , the redshift at which the reionization fraction is a half , A s , measuring the 
initial power spectrum amplitude and n s , the spectral index of the initial power spectrum. We derive the ratio 



1 The CMB temperature anisotropy is very insensitive to the duration of reionization epoch, and we also neglect the small effect of Helium 
reionization and inhomogencitics. 



FIG. 2: Left: 2000 samples from the posterior distribution of the parameters plotted by their f2 m and J1a values. Points are 
colored according to the value of h of each sample, and the solid line shows the flat universe parameters. We assume the base 
parameter set with broad top-hat priors. 

Right: bottom layer (green): supernova constraints; next layer up (red): CMB data alone; next (blue): CMB data plus HST 
Key Project prior; top layer (yellow): all data combined (see text). 68 and 95 per cent confidence limits are shown. 



of the critical density in the form of dark energy, from the constraint £Ik + Oa + O m = 1 (where Q m = il c + Ob is 
the total matter density in units of the critical density). Throughout we use at least the priors that 4 < z rc < 20, 
0.4 < h < 1.0, -0.3 < SIk < 0.3, S1a > 0, and that the age of the universe, t Q , is 10 Gyr < t < 20Gyr. The 
significance of this base set is that this defines the Bayesian priors: there is a flat prior on each parameter of the 
base set. We discuss later how we assess the significance of these priors, and highlight our main results, which are 
largely independent of the priors. (We choose h as a base parameter since the HST Key Project provides a direct 
constraint on this quantity, whereas there are no direct constraints on, e.g. Oa; see Appendix ^ for discussion). The 
above additional constraints on h, Oa, and the age have little effect on the joint results since the cut-off values 
are well into the tails of the distribution. However for the purpose of the Monte-Carlo it is very convenient to be able 
to quickly reject models in the extreme tails without having to compute the theoretical predictions. 

MCMC illustration 

An MCMC sampler provides an efficient way to generate a list of samples from a probability distribution (see 
Appendix [a| for an explanation). All that is required is a function for calculating the probability given a set of 
parameter values. A single sample is a coordinate in the n-D parameter space, and the sampling method ensures that 
the number density of samples is asymptotically proportional to the probability density. As an illustration, in the left 
hand panel of Figure ^ we show the values of Q m = Ob + O c , and Oa, for samples collected from an MCMC run using 
the CMB data and base parameters discussed in the previous paragraphs. 

Note that although O m is not one of our base set of parameters, it is simple to find probabilities as a function of O m 
by taking the parameter values of each sample and deriving the corresponding values of O m . Since the MCMC method 
produces samples from the full posterior, it follows that the number density of samples in this two-dimensional plane 
is proportional to the probability density of the two parameters (marginalized over all the other parameters; note that 
this refers to the fully marginalized probability density rather than any conditional or projected probability density). 
The familiar direction of CMB degeneracy along the flat universe line is apparent. The colors of the dots indicate the 
Hubble constant value for each sample, as given in the color bar. This shows that the high O m tail of samples are 
due entirely to low h regions of parameter space, illustrating the point made in e.g. fig ] that a Hubble constant prior 
alone combined with the CMB can put useful limits on O m and Oa without the need for supernova constraints. 

The likelihood as a function of position in the O m -OA plane from CMB data alone is shown by the red contours in 
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Parameter 


pre VSA/CBI 


+VSA+CBI 


all data 


tt h h 2 

n s 

vi = n m h 2 - 4 (a s e- T /0.7)-°- 85 
v 2 = a 8e - T (V0.7) a5 (n m /0.3)-°- 08 


0.0215 ± 0.0022 
0.985 ± 0.051 
0.114 ± 0.0052 
0.71 ± 0.07 


0.0216 ± 0.0022 
0.982 ± 0.050 
0.113 ± 0.0048 
0.70 ± 0.06 


0.0212 ± 0.0014 
0.980 ± 0.037 
0.111 ± 0.0038 
0.70 ± 0.04 



TABLE I: Marginalized parameter constraints (68 per cent confidence) on four well constrained parameters (see text) from 
two combinations of CMB data assuming a flat universe and varying [fl h h 2 , n c h 2 , h, n s , z la , A 3 ]. Extra digits are inserted to 
help comparison between pre and post VSA and CBI. For comparison the last column gives the results including HST, 2dF, 
BBN and SnIA data. 



the right hand panel of Figure |2|. Note that compared to some other CMB only plots the contours close slightly at 
high Q m , which is due to our lower limit on the Hubble constant of h > 0.4. The 2-D likelihoods shown are calculated 
from the samples by estimating the probability density at a grid of points using a Gaussian weighting kernel and 
then using interpolation to draw the contour lines. This produces a plot very similar to that obtained using a grid 
integration calculation of the marginalized posterior (assuming there are no small scale features in the posterior). 

Extra data can be taken into account quickly by re-weighting the samples, a technique known as importance 
sampling (described in detail in Appendix |b]). The posterior is simply re-calculated at each sample position, and a 
weight assigned to the sample according to the ratio of the new posterior to the old posterior. For example, using a 
Gaussian HST Key Project [^7| prior on the Hubble constant h = 0.72 ± 0.08 we obtain the blue contours plotted on 
the right hand panel of Figure^. By using the weighted number of points as a function of S~Ik we find the marginalized 
result fl k = 0.00 ± 0.03, and therefore that the universe is close to flat (given our assumptions). 

Using importance sampling we have checked that the CMB datasets are consistent using the hyperparameter 
method, as described in Appendix H. 



Quantitative constraints from CMB data alone 



Above we illustrated the MCMC method in the f2 m -fiA plane for a 7 parameter cosmological model. Since there is 
good observational evidence and theoretical motivation for flat models, we now fix Q,k = giving 6 base parameters. 

The only parameters that can be reasonably disentangled are Wb and n s , from the heights of the 2nd and 3rd 
acoustic peaks relative to the first. These constraints are given in the first two rows of Table | and are in accordance 
with previous similar analyses and nucleosynthesis constraints |28|. However note that these two parameters remain 
significantly correlated, so that more detailed information is contained in the result n s ~ (ujb~ 0.022)/0.043 = 0.98±0.04, 
n s + (cj b - 0.022)/0.043 = 0.98 ± 0.09. 

Qualitatively, having 'used up' constraints from the 2nd and 3rd acoustic peak heights to find fl^h 2 and n s we 
might estimate that there remain three more pieces of information in the CMB data, from the large scale (COBE) 
amplitude, the first peak height and the first peak position. We have four remaining parameters e.g. parameterised 
by us, 0, m , h and r (r is the optical depth to reionization 2 , and a% is the root mean square mass perturbation in 
8/i _1 Mpc spheres today assuming linear evolution). Since fl^h 2 and n s are not too correlated with these parameters, 
we marginalize over them and explore the remaining four dimensional parameter space. 

From the set of samples it is straightforward to perform an independent component analysis to identify the well 
determined orthogonal parameter combinations. Calculating the eigenvectors of the correlation matrix in the logs of 
<78e~ r , Q m and h we find the two combinations given in the bottom two rows of Table |. The errors are too large on 
any third constraint to be worth quoting. We expect cr 8 to be roughly degenerate with e~ T because the CMB power 
on scales smaller than the horizon size at reionization (£ > 20) is damped by a factor e~ 2r , and <r 2 scales with the 
power in the primordial perturbation. We find that marginalized values of v\ and i>2 are independent of r to better 
than 2 per cent for 0.02 < r < 0.14 (4 < z rc < 16). As demonstrated in Appendix ^ these constraints are almost 
independent of the choice of base parameters (which define the prior) . 

If we marginalize over eg and z rc we find the principle component fl m h 29 = 0.096 ± 9%, consistent with the 
constraint found and discussed in Ref. 0. However since this quantity is significantly correlated with the amplitude 



2 Assuming rapid reionization the optical depth can be calculated from z re using r = cry J Q Zro dzn e (z)/[(l + z) 2 H(z)], where n e is 
the number density of electrons and cry is the Thompson scattering cross-section. For flat models with cosmological constant r 
0.048 Ob hQm°' 47 ^re 42 (to a few percent over the region of interest), though we do not use this in our analysis. 
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we have quoted a tighter constraint (4%) in Table | by including the a$e~ T dependence in our results. 3 

While restricted to this relatively small parameter space we take this opportunity to investigate the impact of the 
new VSA and CBI results. Using importance sampling we compare the results with and without the VSA/CBI data 
in Table [|. For simplicity we assume the same power law approximation for the combination of <jg,e~ T , h and fl m as 
derived above. The peaks move by a fraction of the error, and the error bars are fractionally smaller. 



III. ADDITIONAL COSMOLOGICAL CONSTRAINTS 



The CMB data alone can only provide a limited number of constraints, so before extending the parameter space 
to make full use of the Monte-Carlo method it is useful to include as much relatively reliable data as possible. Care 
must be taken, since some published parameter constraints assume particular values for parameters that we wish to 
vary. As a simple example, the Supernova Cosmology Project has made available constraints in J7 m -f2A space, which 
could only be used if the dark energy is a cosmological constant. Fortunately the full supernova data are available, 
which we use (described below). However is is not always practical to use the full observed data and it may be best 
to simply increase the error bars to encompass systematic effects due to other parameters. This ensures that the 
resulting MCMC chains will cover all of the relevant parameter space, and can then be importance sampled later 
with a tighter constraint. If the chains are generated too tightly constrained one cannot recover information about 
the excluded parameter space. 

Nucleosynthesis constraints suggest cjb ~ 0.02 [p8|, and we assume the Gaussian prior cjb = 0.020 ± 0.002, (Ict) 
which is somewhat broader than the error quoted in Ref. |^8| to allow for differences with other estimations. We 
include Type 1A supernovae data from [p9| , using the effective magnitudes and errors from the 54 supernovae that 
they included in the primary fit (fit C). We marginalize analytically with a flat prior on the intrinsic magnitudes, which 
is equivalent to evaluating the likelihood at the best fit value (see Appendix |]). We neglect the small correlations 
but multiply the log likelihood by 50/54 to account for the effective degrees of freedom quoted in Ref. [2j|. We 
use the galaxy power spectrum from the first 147,000 redshifts of the 2dF galaxy redshift survey, using the scales 
0.02 < fc/(MVIpc _1 ) < 0.15 where non-linear effects were found to be negligible M. We assume that this is directly 
proportional to the matter power spectrum at z = 0, in other words that the bias and evolution are scale independent 
and also that the redshift space distorted power spectrum is proportional to the real space power spectrum (on the 
large scales used). We assume a flat prior on the proportionality constant and marginalize analytically as described 
in Appendix [f| We also use the HST Key Project prior on the Hubble constant as discussed earlier. 

The top contours in the right hand panel of Figure || shows the effect of the full set of constraints on the basic 7 
parameter model, with the combined constraint on the curvature becoming fix = 0.00 ± 0.02. For the 6 parameter 
flat models the extra data constrains most of the parameters rather well. Table || shows that the new constraints are 
very consistent with those from the CMB alone. The i>2 constraint on erg of Table | is slightly changed and becomes 
a%e~ T (ft,/0.67) a58 = 0.72 ± 0.05 almost independently of £l m . The marginalized results on all parameters are shown 
in Table [n]. The Hubble parameter is shifted to slightly lower values relative to the HST Key Project constraint we 
used. The matter and dark energy densities are spot on the popular 0.3, 0.7 model, although with error bars of 0.05 at 
68 per cent confidence. The combination Q, m h is slightly tighter but has the same central value as quoted in Ref. |3^] 
using 2dF data alone and assuming n s = 1. 

The us result depends on the range used for the reionization redshift z re since the data used mostly constrains 
the combination &gc~ T rather than <rg on its own. Our prior of 4 < z rc < 20 should be reasonably conservative, 
however we also quote the result for <7%e~ T . Looking at the maximum and minimum values contained in the 95 
percent confidence region of the full n-dimcnsional space (see Appendix C) we find 0.62 < cr 8 exp(0.04 — r) < 0.92. 
This may be compared to values of ag found by other independent methods and could in principle be combined with 
these methods to estimate the optical depth (e.g. see Ref. |)], ||l]]). If r = 0.04 and Q m = 0.3 then our result is very 
consistent with the new lower cluster normalisation found by Refs. p2] , ^3, [34, 3^] and just consistent with the cosmic 



shear measurements of Refs. |31|, 3J, |37j, p8| . The high clustering amplitude required to fit the small scale clustering 
observed by CBI of erg ~ 1 (Ref. [24]) is in the tail of the distribution and may require an optical depth rather larger 
than expected in simple models. 

By using subsets of the chains we have checked that the Monte-Carlo sampling noise is negligible at the accuracy 
quoted. The results are Monte-Carlo marginalized over all of the other parameters and also analytically or numerically 
marginalized over the calibration type uncertainties discussed in Section ||. 



Note that the fractional errors depend on the choice of normalization for the log eigenvectors; here we have chosen to normalize so the 
exponent of £7 m is unity. 
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We now demonstrate the power of the Monte-Carlo method by using the above data to constrain a larger number 
of parameters, using the proposal density described in Appendix |X] to exploit the differing computational costs of 
changing the various parameters. We consider separately the case of inflationary models, which are flat, and more 
general models less constrained by theoretical prejudice. In both cases we include f v , the fraction of the dark matter 
that is in the form of massive neutrinos 4 , and allow for an effective constant equation of state parameter 5 w = pj p 
for the dark energy, and assume that — 1 < w < 0. 



Inflationary models 



The simplest single-field inflationary models predict a flat universe and can be described quite accurately by the 
slow-roll approximation. The shape and amplitude of the initial curvature perturbation depends on the shape of 
the inflationary potential, often encoded in 'slow-roll parameters' which are assumed to be small, plus an overall 
normalization which depends on the Hubble rate when the modes left the horizon during inflation. The initial scalar 
and tensor power spectra are parameterized as usual by 6 



k S 



Ph(k) 



(1) 



where n s and n* are the conventional definitions of the spectral indices. At the lowest order approximation the slow- 
roll initial power spectra are determined from the inflationary potential V by the slow-roll parameter parameters e\, 
e2byp) 



A s = 



1&A S ~ 16tt 
f - 2ei 
H 2 



£2 



'"-PI 
8tt 



v 



7reimp[ 



n t = -2ei = - 
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where quantities are evaluated when Ha — k+ (we use k+ = k S Q = k t o = O.OIMpc -1 ). For our analysis we use the 
parameterization of Equations (|f|) and define t\ = A t /16A S . We also impose the slow-roll constraint that spectral 
index of the tensor modes is given by n t — — 2e±. Our results will be consistent with inflationary models in the region 
of parameter space in which ei <C 1, n s ~ 1, but elsewhere can be interpreted more generally (the results are not very 
sensitive to the tensor spectral index). From the definition it is clear that t\ > 0, and except in contrived models one 
also expects n s < 1, though we do not enforce this. Simple ekpyrotic models are consistent with this parameterization 
when there are no tensor modes fl44| . If there were evidence for tensor modes (ei > 0) then this would be direct 
evidence against simple ekpyrotic models. 

Figure^ shows the fully marginalized posterior constraints on the various parameters using the CMB, supernovae, 
HST, and nucleosynthesis constraints, with and without the 2dF data, generated from 7700 weakly correlated samples. 
We generate samples without the 2dF or CBI data, and then importance sample including CBI to compute results 
with and without the 2dF data. The constraints on fl m h and f v are sharpened significantly on adding in 2dF. The 
large shift in the o% distribution comes from the exclusion of the high /„ parameter space due to the new constraint 
on the shape of the matter power spectrum (see discussion of degeneracy below) . 

It is important to check that the parameters are really being constrained, in the sense that the results are relatively 
insensitive to the priors, so in addition to the marginalized posterior we also plot the mean likelihood of the samples. 
These will differ in general, particularly when the result is sensitive to the parameter space volume available, which 
can change as the result of choosing different priors (see Appendix O) . In most of the I-D plots the two methods are 



4 We assume three neutrinos of degenerate mass, as indicated by the atmospheric and solar neutrino oscillation observations |39| pj| , and 
compute the evolution using the fast but accurate method described in Ref. 

5 Many quintessence models can be described accurately by a constant effective equation of state parameter 1421] . We compute the 
perturbations by using a quintessence potential V((j>) with = — 1(1 — w)H<f> and = — 1(1 — w)[H — §(1~+ w)H 2 } that gives a 
constant equation of state. 

6 Here denned so (|x| 2 ) = / dlnfeP x (fe) and {hijh 1 !) = J dlnfc Pf l (fc), where x i s the initial curvature perturbation and hij is the 
transverse traceless part of the metric tensor. These definitions ensure P = const corresponds to scale invariant. 
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FIG. 3: Posterior constraints for 9-parameter flat models using all data. The top nine plots show the constraints on the base 
MCMC parameters, the remaining plots show various derived parameter constraints. Red lines include CMB, HST, SnIA 
and BBN constraints, black lines also include the 2dF data. The solid lines show the fully marginalized posterior, the dotted 
lines show the relative mean likelihood of the samples. The curves are generated from the MCMC samples using a Gaussian 
smoothing kernel l/20th the width of each plot. 



in good agreement indicating that the likelihood is well constrained in n-D space and the priors are not biasing our 
results. However the marginalized value of o% is brought down by the f v > phase space (since massive neutrinos 
damp the small scale power), even though the best fits to the data occur where the neutrinos are very light (the 
correlation is shown in the bottom right hand panel of Figure ^.) Similarly the marginalized value of n s is slightly 
increased by the phase space with ei > 0; this increases the CMB power on large scales, and hence requires a higher 
spectral index for the scalar modes (bottom left panel of Figure f|) . 

We also show in Figure [| that a small degeneracy between the Hubble constant and the matter density remains 
(top left) after the ~ fl m h constraint from the galaxy power spectrum shape is combined with the ~ il m h 3 CMB 
constraint (Table |). Geometrical information from the CMB peak position and the SnIA work together to constrain 
w, but this remains slightly correlated with Q m (top right). 

The results from the 6 and 9 parameter analyses can be compared using the 68 per cent limits given in Table [] 
and the plots in Figure [|. Many of the results are quite robust to the addition of the extra three degrees of freedom. 
The biggest change is in a%e~ T which is brought down by contributions from non-zero f v . 

As discussed in Appendix parameter confidence limits from the full n-D distribution can also easily be calculated 
from a list of samples. We show the marginalized and n-D parameter constraints with the inflationary assumptions 
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0.05 0.1 0.05 0.1 0.15 

1 V 

FIG. 4: All-data posterior constraints for flat inflationary models using. The contours show the 68% and 95% confidence limits 
from the marginalized distribution. The shading shows the mean likelihood of the samples, and helps to demonstrate where 
the marginalized probability is enhanced by a larger parameter space rather than by a better fit to the data (e.g. low n s values 
fit the data better). 





6 parameters 


9 parameters 




+2dF 


no 2dF 


±2dF 


±2dF 


±2dF 




68%- ID 


68%- ID 


68%- ID 


68%-full 


95%-full 


fv 




< 0.10 


< 0.04 


< 0.10 


< 0.13 


w 




< -0.87 


< -0.88 


< -0.68 


< -0.58 


ei 




< 0.032 


< 0.032 


< 0.069 


< 0.085 


m v /eV 




< 0.29 


< 0.14 


< 0.36 


< 0.54 


rio 




< 0.30 


< 0.31 


< 0.92 


< 1.4 


fl b h 2 


0.021 ±0.001 


0.022 ± 0.001 


0.022 ± 0.001 


0.018-0.025 


0.017 - 0.026 




0.113 ±0.008 


0.099 ±0.014 


0.106 ±0.010 


0.082 - 0.130 


0.072 - 0.142 


h 


0.67 ±0.03 


0.67 ±0.05 


0.66 ± 0.03 


0.59 - 0.75 


0.55 - 0.78 




0.98 ± 0.04 


1.02 ±0.05 


1.03 ±0.05 


0.91 - 1.13 


0.87- 1.19 


n A 


0.70 ± 0.04 


0.72 ±0.06 


0.71 ±0.04 


0.58-0.80 


0.54 - 0.82 




0.30 ± 0.04 


0.28 ±0.05 


0.29 ± 0.04 


0.20 - 0.42 


0.18 - 0.46 


to/Gyr 


14.1 ± 0.4 


14.3 ± 0.4 


14.1 ±0.4 


13.3 - 15.0 


13.0 - 15.2 




0.20 ± 0.02 


0.18 ±0.03 


0.19 ±0.02 


0.15-0.25 


0.13-0.26 




0.79 ± 0.06 


0.54 ±0.13 


0.67 ±0.08 


0.49 - 0.93 


0.45 - 0.95 


age~ T 


0.72 ± 0.04 


0.50 ±0.12 


0.61 ±0.07 


0.47-0.81 


0.41 - 0.84 




0.40 ± 0.05 


0.27 ±0.08 


0.34 ±0.05 


0.22 - 0.51 


0.19-0.53 



TABLE II: Parameter constraints for 6 and 9 parameter flat models with all data with or without 2dF. The top section shows 
the constraints on the additional parameters that were fixed in the basic 6 parameter model, the bottom half shows the effect 
these additional parameters have on the results for the basic parameters. ID limits are from the confidence interval of the fully 
marginalized ID distribution, the full limits give the extremal values of the parameters in the full n-dimensional confidence 
region (see Appendix |c| for discussion). Bold parameters are base Monte-Carlo parameters, non-bold parameters are derived 
from the base parameters. 
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in Table Gj. 7 As expected, the n-D limits are much wider than those from the marginalized distributions, most being 
more than twice as wide. 

The combined datasets provide good constraints on the neutrino mass, despite the large parameter space. The 
massive neutrino fraction /„ translates into the neutrino mass via 

V 2 =/A M fe 2 = |^ m,«310^ 2 eV, (5) 

where the last equality follows from our assumption that there are three neutrinos of approximately degenerate mass, 
as indicated by the small mass-squared differences detected by the neutrino oscillation experiments |3^, |4(| . At 95% 
confidence we find the marginalized result m„ < 0.27 eV and the more conservative n-D result m„ < 0.5 eV. The 
tightness of the constraint is predominantly due to the 2dF data, as shown in Figure |j| via the damping effect of 
massive neutrinos on the shape of the small scale matter power spectrum. 8 The result is consistent with the weaker 



limits found in Refs. jf^, 46 under more restricted assumptions. The marginalized result is only slightly affected by 
the large parameter space: computing chains with w = — 1 and At = we obtain the marginalized 95% confidence 
result m v < 0.30 eV (the n-D limit is much less sensitive to the parameter space, and the result does not change 
significantly). Thus the simplest model where all the neutrino masses are very small is still a good bet. 

The result for the quintessence parameter w is consistent with w — —1, corresponding to a cosmological constant. 
The marginalized limit is w < —0.75 at 95% confidence, consistent with Ref. [47j. If we neglect the quintessence 
perturbations it is a simple matter to relax the assumption that w > — 1; for flat models with no tensors or massive 
neutrinos we find the marginalized result —1.6 < w < —0.73 at 95% confidence, and the n-D result —2.6 < w < —0.6 
with the best fit close to w = —1, broadly consistent with Ref. flt8[ . Note that including quintessence perturbations 
leads to a tighter constraint on w due to the increased large scale power. Although perturbations are required for 
consistency with General Relativity, it is possible that a quintessence model may be approximated better by a constant 
w model neglecting perturbations than one including the perturbations. 

The constraint on the tensor mode amplitude (encoded by ei) is weak, as expected due to the large cosmic variance 
on large scales. In table [n] we also show the result for rio = Cj^/Cfg, the ratio of the large scale CMB power in tensor 
and scalar modes. For comparison, with perfect knowledge of all the other parameters and a noise-free sky map, the 
CMB temperature power spectrum cosmic variance detection limit is rio > 0.1. 

The method we have used could be generalized for a more accurate parameterization of the initial power spectrum, 
for example going to second order in the slow-roll parameters [|43|j , which in general introduces a running in the 
spectral index. The current data is however clearly consistent with the simplest scale invariant power spectrum with 
no tensor modes. As a check we have generated chains for flat models with a cosmological constant, no massive 
neutrinos or tensors, but allowing for a running spectral index, and found the 68%-confidence marginalized result 
—0.06 < n run < 0.02 at k = k = 0.05Mpc _1 where n rU n = d 2 (In P x ) / d(ln k) 2 . This corresponds to the running 
spectral index n S)e gf(fc) = dlnP x /dhik = n s (fco) + "run ln(fc/feo)- 



Non-flat 11-parameter models 



We now relax the constraint on the curvature, and allow the tensor spectral index to be a free parameter (we 
assume n t < 0.1). We parameterize the power spectra of the initial curvature and tensor metric perturbations as in 
the inflationary case 9 , except that we now report results for A t /A s rather than a slow-roll parameter, and choose the 
scalar and tensor pivot scales fc s o = 0.05Mpc _1 , k t o — 0.002Mpc _1 (the standard CMBFAST parameterization). 

In Figure ^|we show the parameter constraints that we get using about 10000 weakly correlated samples importance 
sampled to include the 2dF data and CBI. For comparison we plot the equivalent constraints with the 9 and 11 



7 Monte-Carlo samples from the posterior do not provide accurate estimates of the parameter best-fit values (in high dimensions the 
best-fit region typically has a much higher likelihood than the mean, but it occupies a minuscule fraction of parameter space) therefore 
we do not quote best fit points. The high-significance limits are also hard to calculate due to the scarcity of samples in these regions. 
To compute accurate estimates in the tails of the distribution and to ensure the tails are well explored, we sample from a broader 
distribution and then importance sample to the correct distribution, by originally sampling from P x / T where T > 1 (we use T = 1.3). 

8 Our strategy of generating chains without the 2dF data and then importance sampling ensures that we have many samples in the tail 
of the distribution, and hence that our upper limit is robust (since we have have much lower Monte-Carlo noise in the tails than if we 
had generated chains including the 2dF data) . and also makes the effect of the 2dF data easy to assess. 

9 For non-flat models our definitions follow |4S|] . In open models we assume the tensor power spectrum has an additional factor of 
tanh(7ry'— k 2 / K — 3/2) on the right hand side. 
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FIG. 5: Posterior constraints for 11-parameter non-flat models (black lines) using all data, compared with 6 (red) and 9 
(green) parameter models. Dotted lines show the mean likelihood of the samples for the 11-parameter model. Some sampling 
noise is apparent due to the relatively small number of samples used. 

parameter models. The additional freedom in the curvature broadens some of the constraints significantly, though 
the fl m and h 2 constraints are quite robust. 

The tensor spectral index is essentially unconstrained, the difference between the mean likelihood and marginalized 
ID plots being due to the assumed flat prior on the tensor amplitude — at very small amplitudes n t could be anything 
and still be undetectable. The 95% marginalized limit on the curvature is —0.02 < Qk < 0.07. Slightly open models 
fit the data marginally better on average, though a flat universe is well within the 68% confidence contour. The limit 
on the equation of state parameter is slightly weakened to w < —0.69, and neutrino mass is now m„ < 0.4 eV at 95% 
confidence. 



Which model fits best? 

We have explored the posterior distribution in various parameter spaces, deriving parameter constraints in the 
different models. Since we obtain only upper limits on f v , w and A t /A s there is no evidence for massive neutrinos, 
w ^ — 1 or tensor modes using current data. 

One can make a more quantitative comparison of the different models by comparing how well each fits the data. 
As discussed in Appendix |C| a natural measure is the mean likelihood of the data obtained for the different models. 
Equivalently, if one chose a random sample from the possible parameter values, on average how well would it fit the 
data? We find that the six, nine and eleven parameter models have mean likelihoods ratios 1 : 0.4 : 0.3 using all 
the data. So by moving away from the basic model we have not increased the goodness of fit on average (rather 
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the reverse), which is not surprising given how well the basic model fits the data. Most of the distributions of the 
additional parameters peak at their fixed values. 

We also considered the probability of each model, found from marginalizing out all parameters (the 'Evidence' as 
e.g. explained in Ref. f3C|). Since the 9 and 11 parameter models are totally consistent with the 6 parameter model 
then it is already clear that using this method will favor the 6 parameter model for any choice of prior. The numerical 
evidence ratio depends very strongly on the prior, and without a well motivated alternative to the null hypothesis 
(that there are only 6 varying parameters), its value is not useful. The mean likelihood of the samples (above) uses 
the posterior as the prior, which is at least not subjective, and has a convenient interpretation in terms of goodness 
of fit. 

Whilst certainly not ruled out, at the moment there is no evidence for observable effects from the more complicated 
models we have considered. Nonetheless, when considering parameter values, it is important to assess how dependent 
these are on the assumptions, and this can be seen by comparing the results we have presented. We conclude that at 
the moment simple inflationary models with small tilt and tensor amplitude (e.g. small field models with a nearly flat 
potential; or observationally equivalently, ekpyrotic models) account for the data well. On average the fit to the data 
is not improved by adding a cosmologically interesting neutrino mass or by allowing the dark energy to be something 
other than a cosmological constant. 

IV. CONCLUSIONS 

In this paper we have demonstrated the following benefits of sampling methods for cosmological parameter estima- 
tion: 

• The practicality of exploring the full shape of high-dimensional posterior parameter distributions using MCMC. 

• The use of principle component analysis to identify well constrained non-linear combinations of parameters and 
identify degeneracies. 

• Simple calculation of constraints on any parameters that can be derived from the base set (e.g. age of the 
universe, as, ?"io, etc.). 

• Use of the mean likelihood of the samples as an alternative to marginalization to check robustness of results 
and relative goodness of fit. 
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• The calculation of extremal values within the n-D hyper-surface to better represent the range of the full proba- 
bility distribution. 

• The use of importance sampling to quickly compare results with different subsets of the data, inclusion of new 
data, and correction for small theoretical effects. 



Our Monte-Carlo code and chains are publicly available at tittp : //cosmologist . inf o/cosmomc. 

With the current cosmological data we found that the Monte-Carlo approach works well, though simply picking 
the best fit sample does not identify best- fit model to high accuracy (and therefore we do not quote these numbers), 
and there are potential difficulties investigating posterior distributions with multiple local minima (although this is 
not a problem given the parameters and data used here). 

We investigated a 6-D cosmological parameter space and found, for the first time, a concise description of the CMB 
constraints on the matter power spectrum normalization, in addition to tight constraints on fib and n s in agreement 
with previous analyses. The new information from the CBI and VSA interferometers is in good agreement with the 
older data points and we find that our results are negligibly changed on removing this information. 

On adding in constraints from a wide range of cosmological data we evaluated constraints on the above 6 parameter 
model as well as extending to more complicated 9 and 11 parameter models. Many of the constraints on the base 
set of parameters were fairly robust to the addition of this extra freedom, for example the matter density changed 
from fi m = 0.30 ± 0.04 for the basic 6 parameter model to fi m = 0.28 ± 0.07 for the 11 parameter model (68 per 
cent confidence). On the other hand the value for the matter power spectrum normalization on 8/i -1 Mpc scales is 
quite dependent on the neutrino mass, and allowing for a significant neutrino mass decreases the mean value of as 
(the constraint on the amplitude could be improved by better constraints on the small scale CMB amplitude and the 
reionization redshift.) Parameters affecting or sensitive to the late time evolution tend to be rather degenerate, and 
constraints on these are considerably weakened on adding additional freedom in the model. 

We find that the 9 parameter model is quite well constrained by the amount of data used and obtain upper limits 
on a number of interesting cosmological parameters, given our assumptions of a flat universe with slow-roll inflation 
constraints. In particular we find the marginalized constraint m„ < 0.3 eV on the neutrino mass and w < —0.75 for 
the equation of state parameter (95 per cent confidence). There is no evidence for tensor modes, though the constraint 
is currently quite weak, with the constraint on the ratio of the large scale CMB power being rio < 0.7. This constraint 
could be sharpened considerably by restricting the allowed range of scalar spectral indices and neutrino masses. 

In the 11 parameter space the limits are weakened slightly and the standard cosmology of w = —1 and Q,k = is 
near the peak of the posterior probability. The tensor spectral index is essentially unconstrained as expected given 
that the only information comes from the large scale (COBE) CMB data. 

While a detailed investigation of the effect of using all the different combinations of cosmological constraints is 
beyond the scope of this paper we do show the effect of removing the second most powerful constraint (the galaxy 
power spectrum) on the 9 parameter model in Figure || The limits on most of the parameters are affected remarkably 
little. The neutrino mass is the most affected, with the upper limit doubling on removing 2dF. The neutrino mass is 
correlated with the matter power spectrum shape parameter (roughly fi m /i) and amplitude, and these constraints are 
correspondingly weakened on removing 2dF. 

As new better data become available our general method should also be applicable into the future. Due to the enor- 
mously decreased number of likelihood evaluations in the MCMC method compared to other approaches, theoretical 
predictions can be computed essentially exactly, and one can account for the available data in detail. 
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APPENDIX A: THE METROPOLIS-HASTINGS ALGORITHM 



The algorithm that we use for generating samples from the posterior distribution using a Markov Chain is the 
Metropolis-Hastings algorithm. For an introduction and overview of MCMC methods see Ref. |ll], |l2|, |l3| . A Markov 
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Chain moves from a position in parameter space 8\ to the next position 62 with transition probability T (81,62), 
where 9 labels a vector of parameter values. The Metropolis-Hastings transition kernel T(6i,6 2 ) is chosen so that 
the Markov Chain has a stationary asymptotic distribution equal to P{9), where P(0) is the distribution we wish 
to sample from. This is done by using an arbitrary proposal density distribution q(9 n , 9 n+ i) to propose a new point 
6 n+ \ given the chain is currently at 9 n . The proposed new point is then accepted with probability 



fn n v • L P ( d n + l)q(8n+l,0n) \ 

a(e n ,8 n+ x) =mm|l, p^^^ ) 



(Al) 



so that T(d n ,d n+ i) = a(9 n , 9 n+ i)q(9 n , 6 n+ i) . This construction ensures that detailed balance holds, 

P(8 n+1 )T(9 n+1 ,9 n ) = P(9 n )T(9 n , 9 n+1 ), (A2) 

and hence that P(9) is the equilibrium distribution of the chain. 

If the chain is started in a random position in parameter space it will take a little time, burn in, to equilibrate 
before it starts sampling from the posterior distribution. After that time each chain position is a correlated sample 
from the posterior. The correlation is particularly obvious if the proposal is not accepted as then there are two or 
more samples at exactly the same point. However by using only occasional chain positions (thinning the chain) one 
can give the chain time to move to an uncorrelated position in parameter space, and independent samples are then 
obtained. Small residual correlations between samples are unimportant for almost all calculations, though they do 
make the Monte-Carlo error on the results harder to assess. 

For the cases we consider the chains equilibrate rapidly, at worst after thousand or so points. The results can be 
checked easily by using a longer burn in and comparing results. We thin the chain positions by a factor of 25-50 
depending on the number of parameters, leaving weakly correlated samples that we use for importance sampling (see 
Appendix |b|). 

If the proposal density is symmetrical it cancels out when working out the acceptance probability, which then 
becomes just the ratio of the posteriors. This is the case when the proposal density is independent of the current 
position of the chain, which is the case we consider. 



The proposal density 

The choice of proposal density can have a large effect on how the algorithm performs in practice. In general it 
is best to have a proposal density that is of similar shape to the posterior, since this ensures that large changes are 
proposed to parameters along the degeneracy directions. Fortunately with cosmological data we have a reasonable 
idea of what the posterior might look like, and so choosing a sensible proposal density is not difficult. 

If posteriors from models with common parameters are much easier to compute it can be beneficial to use a proposal 
density that changes only a subset of the parameters on each iteration, ensuring that consecutive posterior evaluations 
only differ in a subset of the parameters. Proposing a change to a random subset of the parameters also increases 
the acceptance rate, especially in high dimensions, giving faster piecewise movement around parameter space. In the 
case of CMB parameter estimation, models that differ only by a different normalization of the theoretical CMB power 
spectrum arc very quick to compute once the Ci values for a single model have been calculated. Similarly changing 
parameters that govern calibration uncertainties in the data can also be very quick. However changing parameters 
that govern the perturbation evolution, for example f2 c , etc, will be much slower as in general it requires a detailed 
recalculation of the linear physics. 

If we are comparing CMB data with theoretical models, the most general way to compute the theoretical Ci power 
spectrum is using a fast Boltzmann code such as CAMB |ll| (a parallelized version of CMBFAST Jl7[ ; we discuss less 
accurate and general schemes below). Since the perturbation evolution is assumed to be linear, any parameters 
governing the initial power spectra of the scalar and tensor perturbations will be fast to compute once the transfer 
function for each wavenumber has been computed. Parameters governing the initial power spectrum are therefore 
'fast' parameters. 

We therefore use a proposal density that makes changes only within the subsets of the fast and slow parameters, 
at least when we do not have an approximate covariance matrix available for the posterior. 10 We made the fairly 
arbitrary choice to change a subset of one to three parameters at a time, cycling through the parameters to be changed 



When changing the slow parameters it is possible to also change the fast parameters at the same time. This can be a good idea when 
there are highly correlated slow and fast parameters, for example the reionization redshift and the tensor amplitude. 
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in random order, which gives a high acceptance rate (~ 50%) for the cases we considered. After one initial run one 
can transform to a set of parameters which diagonalize the covariance matrix before doing subsequent runs, allowing 
efficient exploitation of degeneracy information as long as the posterior is reasonably Gaussian. 

The above scheme is sufficient for parameter estimation from current data, however as more data becomes available 
the posterior may become highly non-Gaussian or disjoint, in which case it may become necessary to use more 
sophisticated schemes using simulated annealing, hybrid Monte Carlo, or schemes using cross-chain information |Tl| , 
|l2| , However when the posterior is not disjoint one can often transform to a set of base parameters which are 
relatively independent, in which case a simple Monte-Carlo scheme should continue to work well (see Appendix ^ for 
further discussion). 



APPENDIX B: IMPORTANCE SAMPLING 



Given a set of samples from a distribution P, one can estimate quantities with respect to a different similar 
distribution P' , by weighting the samples in proportion to the probability ratios. This effectively gives a collection 
of non-integer weighted samples for computing Monte-Carlo estimates. For example the expected value of a function 
f{9) under P' is given by 



</(0)> P , = J dOP'(8)f(8) = J dO^±P(0)f(0) 



P{6) 

P -m m ) P (B1) 

Given a set {9 n } of N samples from P a Monte-Carlo estimate is therefore 

</<•»- -4 £^«*->- < B2 » 

n— 1 v 1 

For this to work it is essential that P/P' is never very small, and for a good estimate without massively oversampling 
from P one needs P'/P ~ constant everywhere where P' is significantly non-zero. If P' is non-zero over only a very 
small region compared to P it will be necessary to proportionately oversample from P. 

If the distributions are not normalized, so that J d0P(6) — Z, the ratio of the normalizing constants can be 
estimated using 



z' _ /p(oy\ i A p'{o n ) 



Z \P(0)/ P N^P(G n ) 

and hence 



(B3) 



{m) „ En=iP'^n)/P(e n ).f(e n ) _ 

In Bayesian analysis it can be useful to compute the ratio of the evidences P(D) = J d6P(D, 9), given as above by 



p'{D) _ ( P'{e.p) \ i " p>{D\e n )p'(e n ) 

p(d) \ p(e, d) / mm ~ n ^ P(D\e n )P(6 n ) ' 1 ' 

where the samples {9 n } are drawn from P(0\D). Assuming the distributions are sufficiently similar, the evidence 
under P' can therefore easily be computed from the probability ratios at a sample of points under P, and a known 
evidence under P. In many cases only the ratio is of interest — the ratio is larger than one if on average the probability 
of the samples under P' is higher than under P. In the case where the distributions are very different one may need 
to introduce a series of intermediate distributions that are all not too dissimilar to each other, and perform Monte 
Carlo simulations for each. The evidence ratio one requires is then just the product of that for all the intermediate 
distributions. Many more general schemes are described in |L3[ |5l| , though in this paper we only consider importance 
sampling to similar or subset distributions. 

The simplest application of importance sampling is to adjust results for different priors. For example if one computes 
a chain with flat priors on the parameters, one may wish to importance sample to several different distributions with 
different priors on various parameters. This will work well as long as the prior does not skew the distribution too 
much or give non-zero weight to only a very small fraction of the models. 
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Faster Monte-Carlo 



MCMC runs produce correlated samples from the probability distribution. To obtain independent samples one 
thins out the chain by a sufficiently large factor that the chain has had time to move to a randomly different point 
between the thinned samples. Depending on how one implements the MCMC, the shape of the posterior and the 
number of dimensions the thinning factor can be quite large, typically of the order ten to a thousand. 

By performing Monte-Carlo simulations with a good approximation to the true probability distribution one can 
use importance sampling to correct the results with an accurate calculation of the probabilities. This can be useful 
if computing the probabilities accurately is much slower than computing an approximation, since one only ever 
importance samples independent samples. The burn-in and random walking stages of the Monte-Carlo involve a much 
larger number of probability evaluations, so using a fast approximation when generating the chain saves a lot of time. 

Calculating the posterior from CMB data requires a calculation of the theoretical CMB power spectra, C/. Using 
accurate codes like CAMB and CMBFAST is typically much slower than computing the likelihoods from the data once 
the Ci are known (assuming one uses a radical data-compression scheme, e.g. see Ref. 25 ). In the not so distant 
future we will require to high accuracy C; up to I ~ 2500, including second order effects such as lensing, and also the 
matter power spectrum at various redshifts. Without access to a fast supercomputer this may be prohibitive. 

With a small number of parameters it is possible to use a grid of models and interpolate to generate accurate Ci 
quickly, however as the number of parameters grows the computational cost of computing the grid grows exponentially. 
Also, as second order effects such as gravitational lensing become important, fast grid generation schemes such as the 
fc-splitting scheme of Ref. |^2| become much more difficult to implement accurately. However these may still be useful 
as a fast approximation, as long as the independent samples are corrected with a more accurate calculation. Ref. 
describe a scheme for generating C;s very quickly from a small number of base models, a set of optimized parameters, 
and an accurate calculation of how the Ci vary with these parameters. This gives a very fast approximator over a 
restricted range of parameters that may prove useful combined with importance sampling correction. 

It is also possible to use fast semi-analytic schemes. Typically these are based on a smallish grid of base models, 
from which the C/S in general models are computed quickly on the fly by accounting for changes in the angular 
diameter distance to last scattering, differing perturbation growth rates, etc. These approximate schemes can be 
made quite accurate at at small scales, with significant errors mainly at low I, precisely where the cosmic variance 
is large. So whilst an approximate scheme may produces small systematic errors in the likelihood, if the error is of 
the same order as the cosmic variance or less, the probabilities given the data are bound to be sufficiently similar for 
importance sampling to be valid. 

A particular approximate C\ generator we have tried is CMBfit M , which uses of combination of base C; grids and 
analytic fits. This achieves quite good few percent level accuracy at high I, though larger systematic errors at low 
I. However the code is fast, and we found that importance sampling the results with an exact calculation of the C; 
gives good results, and removes systematic biases introduced by the low I approximations. Such an approach can 
be generalized for more general late time evolution, for example models with quintessence where the effect on small 
scales is due almost entirely to changes in the background equation of state. 

An alternative scheme based on grids of the transfer functions for each wavenumber can produce more accurate 
results, like the recently released DASH [54}| . However this is not much faster than generating the C/S exactly using 
CAMB on a fast multi-processor machine, and relies on a large pre-computed grid (which introduces its own limitations). 
The only real advantage over CMBfit is that more general initial power spectrum parameterization could be accounted 
for easily — something that is impossible with schemes based on grids of C/s. 

Even without a fast semi-analytic scheme, there are a variety of small corrections that can be applied post hoc. For 
example lensing affects the CMB Ci at the few percent level, so one may wish to compute chains without including the 
lensing, then importance sample to correct the results using an accurate calculation including the lensing 11 . For small 
scales at high precision one may also wish to run CAMB at a high-accuracy setting to check that numerical errors in 
the default output are not affecting the results. Also chains could be generated to lower I and the effect of the high-Z 
constraints accounted for by importance sampling. For example we generated the chains using Z max = 1300, and then 
for the nearly independent samples re-computed the power spectra up to / max = 2000 for importance sampling with 
the CBI data. 

Similar methods could be applied for the matter power spectrum using approximate fittings, see e.g. Refs. p2| , p6| . 
However when a fast multi-processor machine is available, and one is interested in a very large number of parameters, 



However if one is also computing the matter power spectrum numerically the additional cost of including the lensing effect is small. 
We have checked that the lensing correction to the results we present is much smaller than the errors (the lensed power spectra can be 
computed with CAMB using the harmonic approach of Ref. pa]). 
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it is much simpler to Monte-Carlo using CAMB to generate the CMB power spectra and matter power spectrum, which 
is what we did for the results we present. The great advantage of this approach is that it generalizes trivially if one 
wishes to include changes in the physics, for example different quintessence models, or changes in the initial power 
spectrum. 

Constraints with new data 

Assuming that one has some new data which is broadly consistent with the current data, in the sense that the 
posterior only shrinks, one can use importance sampling to quickly compute a new posterior including the new data. 
We have made our MCMC chains publicly available, so these can be used to rapidly compute new posteriors from new 
data without incurring any of the considerable computational cost of generating the original chain. For example if you 
have a new constraint on ag, you just need to loop over the samples adjusting the weights of the samples proportional 
to the likelihood under the new constraint. Using importance sampling has the added benefit of making it very easy 
to assess how the new data is changing the posterior. 

APPENDIX C: PARAMETER CONSTRAINTS 

The great advantage of the Monte-Carlo approach is that you have a set of samples from the full parameter space. 
To answer any particular question one can examine the points and compute results reliably, taking full account of the 
shape of the posterior in N dimensions. However for human consumption it is usual to summarize the results as a set 
of parameter values and error bars. 

One way to do this is to use the samples for a principle component analysis to identify the degeneracy directions, as 
we demonstrated in Section ^. By quoting constraints on a set of orthogonalized parameters one retains most of the 
information in the original distribution, as long as it is sufficiently Gaussian (or Gaussian in the log, or some other 
function). However ultimately one is usually interested in the values of some fundamental parameters, and it is also 
useful to find constraints on these alone. 

The simplest approach is to compute the marginalized 1-dimensional distributions for each parameter, essentially 
counting the number of samples within binned ranges of parameter values. Note that this is extremely hard to do 
using a brute-force numerical grid integration calculation as it scales exponentially with the number of dimensions, 
but is quite trivial from a set of Monte-Carlo samples. One can then quote the value at the maximum or mean of 
the ID distribution, along with extremal values of the parameter which contain a fraction / of the samples, where / 
defines the confidence limit. The extremal values could be chosen so that there were the same number of outliers at 
both ends of the distribution, or such that the value of the marginalized probability is the same at each limit. This is a 
good way of summarizing the current state of knowledge as long as you have included everything you know, including 
using a believable prior over parameter space. 

However, frequently one wants to use the parameter estimates to assess consistency with new data or theories, and 
the prior can be very hard to define. For example, on putting in a top hat prior on the age and h, the marginalized 
prior probabilities are not flat, even if all of the other priors are flat broad top hats. This is because the marginalized 
distribution includes the effect of the amount of parameter space available at each point, which can depend quite 
strongly on the value of the parameter. Likewise it is possible to have a region in parameter space which fits the data 
rather well, but because the region is small the marginalized probability of those parameter values can be very low. 

When assessing consistency with new data (or theories), one really wants to know whether the posterior for the 
new data intersects the ./V-dimensional posterior for the current data in a region where both are likely. For example 
one could define the region of parameter space enclosing a fraction / of the points with the highest likelihood as the 
A^-dimensional confidence region, and then see whether this region intersects with the corresponding region for the 
new data, ft is clearly sub-optimal to try to perform this comparison using only ID parameter values and limits, 
however if one quotes the extremal values of each parameter contained in the TV-dimensional confidence region it is 
at least possible to assess whether the iV-dimensional regions might overlap. At least if the new data is outside these 
limits it is a clear indication that there is an inconsistency, whereas using the marginalized limits it shows no such 
thing (just that if there is a consistent region it makes up a small fraction of the original parameter space — something 
one would hope for if the new data is informative!) However it is of course easily possible for the ID likelihood limits 
to be consistent but the full A^-dimensional regions to be highly inconsistent. 

In order to be as informative as possible it can be useful to quote both the marginalized and likelihood limits, 
though of course one should study the full set of samples to make use of as much information as possible. When there 
are strong degeneracies one can quote the constraints on the well-determined orthogonalized parameters. 
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Mean likelihoods 

Often it is useful to show the projected shape of the distribution in one or two dimensions. The marginalized 
distribution, proportional to the number of samples at each point in the projected space, gives the probability density 
in the reduced dimensions, ignoring the values of the parameters in the marginalized dimensions, and is therefore 
usually the quantity of interest. However this looses all the information about the shape of the distribution in the 
marginalized directions, in particular about the goodness of fit and skewness with respect to marginalized parameters. 
Useful complementary information is given by plotting the likelihood of the best fit model at each point, for example 
see Ref. [|J. However it is not so easy to compute this using a small set of Monte-Carlo samples — mean values 
within each bin can be obtained quite accurately from a small number of samples, but getting a good value for the 
maximum in each bin requires a much larger number. Instead we plot the mean likelihood of the samples at each 
value of the parameter, which is easy to compute from the samples. It shows how good a fit you could expect if you 
drew a random sample from the marginalized distribution at each point in the subspace. 

From a distribution P(9) one can derive the (marginalized) distribution of a derived parameter vector of interest 
v = h{9) by 

P(v) = M(P,v) = J A9P{9)5{h{9)-\). (CI) 
Assuming flat priors on 9 the expected mean likelihood of samples with h(9) = v is 

(P(9 ■ h(9) v)) JMPW 2 mO)-v) M(P',v) 

(P(9 . h(9) v)) = j dep(mm _ v) M(p>v) ■ (C2) 

Frequently h(9) is a projection operator into a subspace of 9 (for example h(9) — Q\ for marginalization down to the 
first parameter). If this is the case and P{9) is a multivariate Gaussian distribution, the marginalized distribution 
M(P, v) is also a Gaussian (readily proved using Fourier transforms; the covariance is given by the projected covariance 
matrix). Since the square of a Gaussian is a Gaussian it follows that M(P 2 ,v) oc M(P, v) 2 , and hence the mean 
likelihood is proportional to the marginalized distribution M{P, v). This also follows trivially if P is separable with 
respect to the subspace. In the case of Gaussian or separable distributions the mean likelihood curve is therefore 
proportional to the marginalized distribution and the two curves look the same. Differences in the curves therefore 
indicate non-Gaussianity, for example when one of the marginalized parameters is skewing the distribution in a 
particular direction (for example the effect of massive neutrinos f v > on the ag curve in Figure |3[ if f v was fixed at 
it's maximum likelihood value the marginalized result for erg would change significantly in the direction of the mean 
likelihood curve). The converse does not hold of course, it is possibly to have a non-Gaussian distribution where both 
curves are the same. If the priors on the parameters are not flat this will also show up as differences in the curves 
even if the likelihood distribution is Gaussian. 

Effect of the prior 

In our analysis we chose a particular set of base parameters which were assigned flat priors. This choice was fairly 
arbitrary, and there are other possible choices. For example one might instead use Q\ as a base parameter and derive 
h from the constraint 



Q h h 2 + Q c h 2 



i - n A - n 



K 



(C3) 



In this case the prior on h is given by 

p(h, n h h 2 , n c h 2 ,n K ) = p(n A> n h h 2 , n c h 2 , n K )^- = 2^p(fi A) n^^h 2 ^), (C4) 

and so the prior on h is proportional to 0, m /h if the prior on f2 A is flat. Using ftasa derived parameter therefore tends 
to give results which favor lower h values and higher f2 m values. Using importance sampling it is straightforward to 
adjust results from one set of base parameters to another by weighting the samples by the corresponding ratio of the 
priors 12 . 



12 However the tails of the distributions can change significantly, so it may be necessary to generate the original chain at a higher 
temperature to importance sample accurately. In this example we generated the chain at a temperature of 1.3, so the samples were 
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FIG. 7: Parameter constrains from the CMB alone with flat prior on h (£2a derived, black lines) and flat prior on Qa (h 
derived, red lines). Dotted lines show the mean likelihood of the samples, solid lines the estimated marginalized distribution. 
In most cases both sets of dotted lines are almost on top of one another. 



For the results of the parameter estimation to be meaningful it is essential that the priors on the base set are well 
justified, or that the results are independent of the choice. In Figure ^| we show the effect on the posterior constraints 
from the CMB data from the 6 parameter analysis using different base sets. The distributions shift by a fraction of 
their width, though this can have quite a large effect on the high-significance limits of weakly constrained parameters 
(for example the 95% confidence limit is h < 0.89 with h a base parameter, h < 0.59 with a base parameter). 

For well constrained parameters the prior effectively becomes flatter over the region of interest and the effect is 
much less significant. As shown on the right of Figure |^ the posteriors of four parameters that are well constrained 
by the CMB are almost independent of the choice of prior. 

As shown in Figure |c| plotting the mean likelihood of the samples gives a clear indication of the direction in which 
results may be biased relative to a different choice of prior. It is also clear that by choosing h as a base parameter we 
are getting more samples in the region of interest for comparison with other data. In particular using Q\ as a base 
parameter gives a sharp cut-off at the higher values of h, which are allowed by the HST prior. One slight disadvantage 
of using h rather than Q\ is that the correlation of h with some of the other base parameters is significant, which 
may make the Monte-Carlo sampling less efficient. However since we use the covariance matrix to rotate to a set of 
orthogonalized parameters after one short initial run this is not a major problem. 

The efficiency of the MCMC implementation can be improved by using a set of parameters for which the posterior 
is as symmetric as possible |33|. It may therefore be a good idea to transform to a better set of base parameters, 
for example one could transform to a set of orthogonalized parameters derived from a principle component analysis 
using some less constraining data. However when performing a non-linear transformation of parameters it is also 
necessary to transform the flat priors on the parameters to obtain equivalent results. If one assumes flat priors in the 
transformed parameters it is wise to check that this does not induce a strong prior bias on the cosmological parameters 
of interest. 



APPENDIX D: GOODNESS OF FIT 

To consider whether an enlarged parameter space is justified, one ideally wants to compare the evidences P(D) with 
the different parameter sets. In some cases, for example when using hyperparameter weights on experiments, it may 
be possible to define a prior on the extra parameters in which case one can compute the evidence ratio directly. The 
ratio does however depend quite strongly on the prior put on the parameters, which in general it is not straightforward 



drawn from p 10 / 13 and then importance sampled. 
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to quantify. If one puts a broad prior on a parameter, but the likelihood is much narrower, the probability of the 
data is down-weighted because the likelihood takes up a much smaller region of parameter space. One simple, but 
non-Bayesian, way to get round this is to set the prior equal to the normalized posterior for computing the evidence, 
in which case one compare the values of 

, . f d6P(D\0)P(9\D) 1 , , , 

This is just the expected probability of the data in the posterior distribution, which can be estimated trivially from a 
set of Monte-Carlo samples as the mean likelihood of the samples. For Gaussian distributions this is the exponential 
mean of the chi-squareds under the posterior distribution, and is thus a smeared-out version of the common practice 
of quoting the chi-squared of the best fit. The smearing out helps to down- weight extra parameters which have to 
be fine tuned to obtain better fits. If the mean likelihood is bigger with the extra parameters it suggests they are 
improving the fit to the data on average. Although we know no way to use the value rigorously for hypothesis testing, 
it seems nonetheless to be useful as a rule of thumb measure of goodness of fit. 

APPENDIX E: CONSISTENCY OF DATA SETS 

It is important to assess whether the datasets being used are consistent, or whether one or more is likely to be 
erroneous. This can be done by introducing hyperparameter weights on the different datasets |37[ |58f when performing 
the analysis. If a dataset is inconsistent, its posterior hyperparameter will have a low value, and the dataset then only 
contributes weakly to the posterior probability of the parameters. In the case that the likelihoods are of Gaussian form 
it is a simple matter to marginalize over the hyperparameters analytically given a simple prior. To assess whether the 
introduction of hyperparameters is justified (i.e. whether the data are inconsistent with respect to the model), one 
can compare the probability of obtaining the data in the two hypotheses: Hq, no hyperparameters are needed; Hi, 
hyperparameters are needed because one or more datasets are inconsistent. Using a maximum entropy prior assuming 
that on average the hyperparameter weights are unity, Ref. Q gives 

P(D\6,Hi) _ n 2"^+TK/2 + 1) 

P(D\0,H o ) 11 {x 2 +2)e -xl/2 ' ^ > 

where k labels the datasets, each containing n& points. G iven a set of independent samples from H it is straightforward 



to compute an estimate of the evidence ratio using Eq. ( B5 ) . If the datasets are inconsistent the importance sampling 



estimate would be very inaccurate as the probability distributions would be significantly different. However this 
should be clear when one computes the estimate since the probability ratios will vary wildly. If one suspects that one 
of the datasets is inconsistent it would be better to start with sampling from Hi , and confirm that the evidence ratio 
supports using the hyperparameters. 

An even simpler way of assessing consistency of the datasets might be to compare the mean likelihood of the samples 
in the region of overlap of the posterior distributions to the overall mean likelihood under the original posterior. If 
the mean likelihood of the samples in the region of overlap is much less than the original mean, it is an indication 
than the regions of high likelihood under each dataset do not overlap well in TV-dimensions, and hence there may be 
an inconsistency. In practice the samples in the region of overlap can be found by importance sampling additional 
datasets. The mean likelihoods should always be computed with respect to the same, original, dataset (or group of 
datasets) . However importance sampling may fail to identify inconsistencies in particular cases when the distributions 
have multiple maxima. 

APPENDIX F: ANALYTIC MARGIN ALIZATION 

Frequently one has data in which there is an unknown calibration uncertainty, or an unknown normalization. These 
parameters can be marginalized over analytically following p6fl as long as the likelihoods are Gaussian, and the prior 
on the amplitude parameter is Gaussian or flat. Typically one has a marginalization of the form 

Loc J daP(a)cxp[-(av-d) T N- 1 {av-d)/2] (Fl) 

where v and d are vectors, TV is the noise covariance matrix, and P(a) is the prior. For example for the supernovae 
data v is assumed to be a vector of equal constants giving the intrinsic magnitudes of the supernovae, and d is a 
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vector of the theoretical minus the observed effective magnitudes. If the prior P(a) — const it clearly cannot be 
normalized, however the marginalization is trivial giving 

/ /V _1 vv T /V _1 \ 

-2\nL = d T AT" 1 d + ln(v T JV _1 v) + const. (F2) 

\ v J N l v J 

In the case that v is a constant (independent of the data and parameters), one has L oc e~ x rff/ 2 where 

Xe 2 ff = d T (TV" 1 - ^TN-Z 1 ) d = Xl -^ (F3) 

This is exactly the same as the best-fit one obtains by minimizing the likelihood w.r.t. a, and so in this case the 
maximization technique of Ref. [Q] is exactly equivalent to full marginalization. For example, in the case of the 
supernovae data, marginalization with a flat prior over the magnitudes is equivalent to using the best-fit magnitude. 
In general this is not true as the logarithmic dependence ln(v T 7V -1 v) can depend on the parameters. For example 
with the 2dF data v would be the predicted matter power spectrum values, and a would be the unknown amplitude 
relative to the galaxy power spectrum at z = 0.17. 

The marginalized result is only 'correct' if the assumed flat prior is correct; it is an advantage of the maximization 
technique that the result does not depend on the prior. 
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